{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/jeffheaton/app_deep_learning/blob/main/t81_558_class_02_5_pandas_features.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# T81-558: Applications of Deep Neural Networks\n",
    "**Module 2: Python for Machine Learning**\n",
    "* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)\n",
    "* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Module 2 Material\n",
    "\n",
    "Main video lecture:\n",
    "\n",
    "* Part 2.1: Introduction to Pandas [[Video]](https://www.youtube.com/watch?v=wixHCvnvnsU&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_02_1_python_pandas.ipynb)\n",
    "* Part 2.2: Categorical Values [[Video]](https://www.youtube.com/watch?v=Fm7Ax23hDP0&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_02_2_pandas_cat.ipynb)\n",
    "* Part 2.3: Grouping, Sorting, and Shuffling in Python Pandas [[Video]](https://www.youtube.com/watch?v=tUhaD8xWd7k&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_02_3_pandas_grouping.ipynb)\n",
    "* Part 2.4: Using Apply and Map in Pandas [[Video]](https://www.youtube.com/watch?v=YNo_mg1RrkM&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_02_4_pandas_functional.ipynb)\n",
    "* **Part 2.5: Feature Engineering in Pandas for Deep Learning in PyTorch** [[Video]](https://www.youtube.com/watch?v=ezaVtM405Qs&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_02_5_pandas_features.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Google CoLab Instructions\n",
    "\n",
    "The following code ensures that Google CoLab is running the correct version of TensorFlow."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Note: using Google CoLab\n"
     ]
    }
   ],
   "source": [
    "try:\n",
    "    COLAB = True\n",
    "    print(\"Note: using Google CoLab\")\n",
    "except:\n",
    "    print(\"Note: not using Google CoLab\")\n",
    "    COLAB = False"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Part 2.5: Feature Engineering\n",
    "\n",
    "Feature engineering is an essential part of machine learning.  For now, we will manually engineer features.  However, later in this course, we will see some techniques for automatic feature engineering.  \n",
    "\n",
    "## Calculated Fields\n",
    "\n",
    "It is possible to add new fields to the data frame that your program calculates from the other fields.  We can create a new column that gives the weight in kilograms.  The equation to calculate a metric weight, given weight in pounds, is:\n",
    "\n",
    "$$ m_{(kg)} = m_{(lb)} \\times 0.45359237 $$\n",
    "\n",
    "The following Python code performs this transformation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>mpg</th>\n",
       "      <th>weight_kg</th>\n",
       "      <th>cylinders</th>\n",
       "      <th>...</th>\n",
       "      <th>year</th>\n",
       "      <th>origin</th>\n",
       "      <th>name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>18.0</td>\n",
       "      <td>1589</td>\n",
       "      <td>8</td>\n",
       "      <td>...</td>\n",
       "      <td>70</td>\n",
       "      <td>1</td>\n",
       "      <td>chevrolet chevelle malibu</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>15.0</td>\n",
       "      <td>1675</td>\n",
       "      <td>8</td>\n",
       "      <td>...</td>\n",
       "      <td>70</td>\n",
       "      <td>1</td>\n",
       "      <td>buick skylark 320</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>396</th>\n",
       "      <td>28.0</td>\n",
       "      <td>1190</td>\n",
       "      <td>4</td>\n",
       "      <td>...</td>\n",
       "      <td>82</td>\n",
       "      <td>1</td>\n",
       "      <td>ford ranger</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>397</th>\n",
       "      <td>31.0</td>\n",
       "      <td>1233</td>\n",
       "      <td>4</td>\n",
       "      <td>...</td>\n",
       "      <td>82</td>\n",
       "      <td>1</td>\n",
       "      <td>chevy s-10</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>398 rows × 10 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      mpg  weight_kg  cylinders  ...  year  origin                       name\n",
       "0    18.0       1589          8  ...    70       1  chevrolet chevelle malibu\n",
       "1    15.0       1675          8  ...    70       1          buick skylark 320\n",
       "..    ...        ...        ...  ...   ...     ...                        ...\n",
       "396  28.0       1190          4  ...    82       1                ford ranger\n",
       "397  31.0       1233          4  ...    82       1                 chevy s-10\n",
       "\n",
       "[398 rows x 10 columns]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import os\n",
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv(\n",
    "    \"https://data.heatonresearch.com/data/t81-558/auto-mpg.csv\", \n",
    "    na_values=['NA', '?'])\n",
    "\n",
    "df.insert(1, 'weight_kg', (df['weight'] * 0.45359237).astype(int))\n",
    "pd.set_option('display.max_columns', 6)\n",
    "pd.set_option('display.max_rows', 5)\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Google API Keys\n",
    "\n",
    "Sometimes you will use external APIs to obtain data.  The following examples show how to use the Google API keys to encode addresses for use with neural networks.  To use these, you will need your own Google API key.  The key I have below is not a real key; you need to put your own there.  Google will ask for a credit card, but there will be no actual cost unless you use a massive number of lookups.  YOU ARE NOT required to get a Google API key for this class; this only shows you how.  If you want to get a Google API key, visit this site and obtain one for **geocode**.\n",
    "\n",
    "You can obtain your key from this link: [Google API Keys](https://developers.google.com/maps/documentation/embed/get-api-key)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "if 'GOOGLE_API_KEY' in os.environ:\n",
    "    # If the API key is defined in an environmental variable,\n",
    "    # the use the env variable.\n",
    "    GOOGLE_KEY = os.environ['GOOGLE_API_KEY']    \n",
    "else:\n",
    "    # If you have a Google API key of your own, you can also just\n",
    "    # put it here:\n",
    "    GOOGLE_KEY = 'REPLACE WITH YOUR GOOGLE API KEY'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Other Examples: Dealing with Addresses\n",
    "\n",
    "Addresses can be difficult to encode into a neural network.  There are many different approaches, and you must consider how you can transform the address into something more meaningful.  Map coordinates can be a good approach.  [latitude and longitude](https://en.wikipedia.org/wiki/Geographic_coordinate_system) can be a useful encoding.  Thanks to the power of the Internet, it is relatively easy to transform an address into its latitude and longitude values.  The following code determines the coordinates of [Washington University](https://wustl.edu/):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The provided API key is invalid. \n"
     ]
    }
   ],
   "source": [
    "import requests\n",
    "\n",
    "address = \"1 Brookings Dr, St. Louis, MO 63130\"\n",
    "\n",
    "response = requests.get(\n",
    "    'https://maps.googleapis.com/maps/api/geocode/json?key={}&address={}' \\\n",
    "    .format(GOOGLE_KEY,address))\n",
    "\n",
    "resp_json_payload = response.json()\n",
    "\n",
    "if 'error_message' in resp_json_payload:\n",
    "    print(resp_json_payload['error_message'])\n",
    "else:\n",
    "    print(resp_json_payload['results'][0]['geometry']['location'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "They might not be overly helpful if you feed latitude and longitude into the neural network as two features.  These two values would allow your neural network to cluster locations on a map.  Sometimes cluster locations on a map can be useful.  Figure 2.SMK shows the percentage of the population that smokes in the USA by state.\n",
    "\n",
    "**Figure 2.SMK: Smokers by State**\n",
    "![Smokers by State](https://raw.githubusercontent.com/jeffheaton/t81_558_deep_learning/master/images/class_6_smokers.png \"Smokers by State\")\n",
    "\n",
    "The above map shows that certain behaviors, like smoking, can be clustered by the global region. \n",
    "\n",
    "However, often you will want to transform the coordinates into distances.  It is reasonably easy to estimate the distance between any two points on Earth by using the [great circle distance](https://en.wikipedia.org/wiki/Great-circle_distance) between any two points on a sphere:\n",
    "\n",
    "The following code implements this formula:\n",
    "\n",
    "$$ \\Delta\\sigma=\\arccos\\bigl(\\sin\\phi_1\\cdot\\sin\\phi_2+\\cos\\phi_1\\cdot\\cos\\phi_2\\cdot\\cos(\\Delta\\lambda)\\bigr) $$\n",
    "\n",
    "$$ d = r \\, \\Delta\\sigma $$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "ename": "ValueError",
     "evalue": "Google API error on: 1 Brookings Dr, St. Louis, MO 63130",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[5], line 43\u001b[0m\n\u001b[1;32m     40\u001b[0m address1 \u001b[39m=\u001b[39m \u001b[39m\"\u001b[39m\u001b[39m1 Brookings Dr, St. Louis, MO 63130\u001b[39m\u001b[39m\"\u001b[39m \n\u001b[1;32m     41\u001b[0m address2 \u001b[39m=\u001b[39m \u001b[39m\"\u001b[39m\u001b[39m3301 College Ave, Fort Lauderdale, FL 33314\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[0;32m---> 43\u001b[0m lat1, lng1 \u001b[39m=\u001b[39m lookup_lat_lng(address1)\n\u001b[1;32m     44\u001b[0m lat2, lng2 \u001b[39m=\u001b[39m lookup_lat_lng(address2)\n\u001b[1;32m     46\u001b[0m \u001b[39mprint\u001b[39m(\u001b[39m\"\u001b[39m\u001b[39mDistance, St. Louis, MO to Ft. Lauderdale, FL: \u001b[39m\u001b[39m{}\u001b[39;00m\u001b[39m km\u001b[39m\u001b[39m\"\u001b[39m\u001b[39m.\u001b[39mformat(\n\u001b[1;32m     47\u001b[0m         distance_lat_lng(lat1,lng1,lat2,lng2)))\n",
      "Cell \u001b[0;32mIn[5], line 31\u001b[0m, in \u001b[0;36mlookup_lat_lng\u001b[0;34m(address)\u001b[0m\n\u001b[1;32m     29\u001b[0m json \u001b[39m=\u001b[39m response\u001b[39m.\u001b[39mjson()\n\u001b[1;32m     30\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mlen\u001b[39m(json[\u001b[39m'\u001b[39m\u001b[39mresults\u001b[39m\u001b[39m'\u001b[39m]) \u001b[39m==\u001b[39m \u001b[39m0\u001b[39m:\n\u001b[0;32m---> 31\u001b[0m     \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(\u001b[39m\"\u001b[39m\u001b[39mGoogle API error on: \u001b[39m\u001b[39m{}\u001b[39;00m\u001b[39m\"\u001b[39m\u001b[39m.\u001b[39mformat(address))\n\u001b[1;32m     32\u001b[0m \u001b[39mmap\u001b[39m \u001b[39m=\u001b[39m json[\u001b[39m'\u001b[39m\u001b[39mresults\u001b[39m\u001b[39m'\u001b[39m][\u001b[39m0\u001b[39m][\u001b[39m'\u001b[39m\u001b[39mgeometry\u001b[39m\u001b[39m'\u001b[39m][\u001b[39m'\u001b[39m\u001b[39mlocation\u001b[39m\u001b[39m'\u001b[39m]\n\u001b[1;32m     33\u001b[0m \u001b[39mreturn\u001b[39;00m \u001b[39mmap\u001b[39m[\u001b[39m'\u001b[39m\u001b[39mlat\u001b[39m\u001b[39m'\u001b[39m],\u001b[39mmap\u001b[39m[\u001b[39m'\u001b[39m\u001b[39mlng\u001b[39m\u001b[39m'\u001b[39m]\n",
      "\u001b[0;31mValueError\u001b[0m: Google API error on: 1 Brookings Dr, St. Louis, MO 63130"
     ]
    }
   ],
   "source": [
    "from math import sin, cos, sqrt, atan2, radians\n",
    "\n",
    "URL='https://maps.googleapis.com' + \\\n",
    "    '/maps/api/geocode/json?key={}&address={}'\n",
    "\n",
    "# Distance function\n",
    "def distance_lat_lng(lat1,lng1,lat2,lng2):\n",
    "    # approximate radius of earth in km\n",
    "    R = 6373.0\n",
    "\n",
    "    # degrees to radians (lat/lon are in degrees)\n",
    "    lat1 = radians(lat1)\n",
    "    lng1 = radians(lng1)\n",
    "    lat2 = radians(lat2)\n",
    "    lng2 = radians(lng2)\n",
    "\n",
    "    dlng = lng2 - lng1\n",
    "    dlat = lat2 - lat1\n",
    "\n",
    "    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlng / 2)**2\n",
    "    c = 2 * atan2(sqrt(a), sqrt(1 - a))\n",
    "\n",
    "    return R * c\n",
    "\n",
    "# Find lat lon for address\n",
    "def lookup_lat_lng(address):\n",
    "    response = requests.get( \\\n",
    "        URL.format(GOOGLE_KEY,address))\n",
    "    json = response.json()\n",
    "    if len(json['results']) == 0:\n",
    "        raise ValueError(\"Google API error on: {}\".format(address))\n",
    "    map = json['results'][0]['geometry']['location']\n",
    "    return map['lat'],map['lng']\n",
    "\n",
    "\n",
    "# Distance between two locations\n",
    "\n",
    "import requests\n",
    "\n",
    "address1 = \"1 Brookings Dr, St. Louis, MO 63130\" \n",
    "address2 = \"3301 College Ave, Fort Lauderdale, FL 33314\"\n",
    "\n",
    "lat1, lng1 = lookup_lat_lng(address1)\n",
    "lat2, lng2 = lookup_lat_lng(address2)\n",
    "\n",
    "print(\"Distance, St. Louis, MO to Ft. Lauderdale, FL: {} km\".format(\n",
    "        distance_lat_lng(lat1,lng1,lat2,lng2)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Distances can be a useful means to encode addresses.  It would help if you considered what distance might be helpful for your dataset.  Consider:\n",
    "\n",
    "* Distance to a major metropolitan area\n",
    "* Distance to a competitor\n",
    "* Distance to a distribution center\n",
    "* Distance to a retail outlet\n",
    "\n",
    "The following code calculates the distance between 10 universities and Washington University in St. Louis:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "School 'Princeton', distance to wustl is: 1354.4830895052746\n",
      "School 'Harvard', distance to wustl is: 1670.6297027161022\n",
      "School 'University of Chicago', distance to wustl is: 418.0815972177934\n",
      "School 'Yale', distance to wustl is: 1508.217831712127\n",
      "School 'Columbia University', distance to wustl is: 1418.2264083295695\n",
      "School 'Stanford', distance to wustl is: 2780.6829398114114\n",
      "School 'MIT', distance to wustl is: 1672.4444489665696\n",
      "School 'Duke University', distance to wustl is: 1046.7970984423719\n",
      "School 'University of Pennsylvania', distance to wustl is: 1307.19541200423\n",
      "School 'Johns Hopkins', distance to wustl is: 1184.3831076555425\n"
     ]
    }
   ],
   "source": [
    "# Encoding other universities by their distance to Washington University\n",
    "\n",
    "schools = [\n",
    "    [\"Princeton University, Princeton, NJ 08544\", 'Princeton'],\n",
    "    [\"Massachusetts Hall, Cambridge, MA 02138\", 'Harvard'],\n",
    "    [\"5801 S Ellis Ave, Chicago, IL 60637\", 'University of Chicago'],\n",
    "    [\"Yale, New Haven, CT 06520\", 'Yale'],\n",
    "    [\"116th St & Broadway, New York, NY 10027\", 'Columbia University'],\n",
    "    [\"450 Serra Mall, Stanford, CA 94305\", 'Stanford'],\n",
    "    [\"77 Massachusetts Ave, Cambridge, MA 02139\", 'MIT'],\n",
    "    [\"Duke University, Durham, NC 27708\", 'Duke University'],\n",
    "    [\"University of Pennsylvania, Philadelphia, PA 19104\", \n",
    "         'University of Pennsylvania'],\n",
    "    [\"Johns Hopkins University, Baltimore, MD 21218\", 'Johns Hopkins']\n",
    "]\n",
    "\n",
    "lat1, lng1 = lookup_lat_lng(\"1 Brookings Dr, St. Louis, MO 63130\")\n",
    "\n",
    "for address, name in schools:\n",
    "    lat2,lng2 = lookup_lat_lng(address)\n",
    "    dist = distance_lat_lng(lat1,lng1,lat2,lng2)\n",
    "    print(\"School '{}', distance to wustl is: {}\".format(name,dist))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3.9 (torch)",
   "language": "python",
   "name": "pytorch"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
